{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gradient Agreement as an Optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Gradient Agreement algorithm is an interesting and recently introduced algorithm that acts as an enhancement to the meta learning algorithms. In MAML and reptile, we seek to find the better model parameter that is generalizable across several related tasks so that we can learn quickly with fewer data points. If we recollect what we have learned in the previous chapters, we have seen that we randomly initialize the model parameter and then we sample a random batch of tasks  $T_i$ from the task distribution $p(T)$.  For each of the sampled tasks, we minimize the loss by calculating gradients get the updated parameters for each of the task $\\theta'_i$  and that forms our inner loop. \n",
    "\n",
    "$\\theta'_i = \\theta - \\alpha \\nabla_{\\theta} L_{T_i}(f_{\\theta}) $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After calculating the optimal parameter for each of the sampled tasks, we perform meta optimization i.e we perform meta optimization by calculating loss in a new set of tasks and we minimize loss by calculating gradients with respect to the optimal parameters $\\theta'_i$  we obtained in the inner loop and update our initial model parameter  $\\theta$,\n",
    "\n",
    "$\\theta = \\theta - \\beta \\nabla_{\\theta} \\sum_{T_i \\sim p(T)} L_{T_i}(f_{\\theta_i}) $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What's really going in the above equation? If you closely examine the above equation, you can notice that we are merely taking an average of gradients across tasks and updating our model parameter $\\theta$  which implies all tasks contribute equally in updating our model parameter. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But what's wrong with this? let us say we have sampled 5 tasks and four tasks have a gradient update in one direction but one task have a gradient update in a direction which completely differs from other tasks. Since the gradient of all the tasks contributes equally in updating the model parameter, this disagreement can have a serious impact on updating the model's initial parameter. As you can see in the below figure, all tasks from $T_1$ to  $T_4$ have a gradient in one direction and but the task $T_5$ has a gradient in a completely different direction compared to other tasks. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__ ![Image to be added](1.png) __"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So what should we now? How can we understand which task has a strong gradient agreement and which tasks have a strong disagreement? If we associate weights with the gradients can't we understand the importance? So we rewrite our outer gradient update equation by adding the weights multiplied with each of the gradients as,\n",
    "\n",
    "$\\theta = \\theta -\\beta \\sum_i w_i \\nabla L_{T_i}(f_{\\theta_i}) $"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, how do we calculate these weights? these weights are proportional to the inner product of the gradients of a task and an average of gradients of all the tasks in the sampled batch of tasks. But what does this implies? \n",
    "\n",
    "__It implies if the gradient of a task is in the same direction as the average gradient of all tasks in sampled batch of tasks then we can increase its weights so that it will contribute more in updating our model parameter similarly if the gradient of a task is in the direction which is greatly different from the average gradient of all tasks in sampled batch of tasks then we can decrease its weights so that it will contribute less in updating our model parameter. We will see how exactly these weights are computed in the following section. __\n",
    "\n",
    "We can not only apply our gradient agreement algorithm to MAML but also to the reptile algorithm. So our reptile update equation becomes,\n",
    "\n",
    "$\\theta = \\theta + \\alpha \\sum_i w_i(\\theta'_i - \\theta) $"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
